channel size
- North America > United States (0.04)
- Europe > France (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
A Model, training, and dataset details All models are trained end-to-end with the Gumbel-Softmax [
Models are trained on a single Titan Xp GPU on an internal cluster. Training time is typically 6-8 hours on 4 CPUs and 32GB of RAM. We train with batch size B = 128 . Like ShapeWorld, RNN encoders and decoders are single layer GRUs with hidden size 1024 and embedding size 500. For additional example games from both datasets, see Figure S1.
Neural Spectral Band Generation for Audio Coding
Choi, Woongjib, Kim, Byeong Hyeon, Lim, Hyungseob, Jang, Inseon, Kang, Hong-Goo
Spectral band replication (SBR) enables bit-efficient coding by generating high-frequency bands from the low-frequency ones. However, it only utilizes coarse spectral features upon a subband-wise signal replication, limiting adaptability to diverse acoustic signals. In this paper, we explore the efficacy of a deep neural network (DNN)-based generative approach for coding the high-frequency bands, which we call neural spectral band generation (n-SBG). Specifically, we propose a DNN-based encoder-decoder structure to extract and quantize the side information related to the high-frequency components and generate the components given both the side information and the decoded core-band signals. The whole coding pipeline is optimized with generative adversarial criteria to enable the generation of perceptually plausible sound. From experiments using AAC as the core codec, we show that the proposed method achieves a better perceptual quality than HE-AAC-v1 with much less side information.
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > South Korea > Daejeon > Daejeon (0.04)
Searching for Effective Preprocessing Method and CNN-based Architecture with Efficient Channel Attention on Speech Emotion Recognition
Speech emotion recognition (SER) classifies human emotions in speech with a computer model. Recently, performance in SER has steadily increased as deep learning techniques have adapted. However, unlike many domains that use speech data, data for training in the SER model is insufficient. This causes overfitting of training of the neural network, resulting in performance degradation. In fact, successful emotion recognition requires an effective preprocessing method and a model structure that efficiently uses the number of weight parameters. In this study, we propose using eight dataset versions with different frequency-time resolutions to search for an effective emotional speech preprocessing method. We propose a 6-layer convolutional neural network (CNN) model with efficient channel attention (ECA) to pursue an efficient model structure. In particular, the well-positioned ECA blocks can improve channel feature representation with only a few parameters. With the interactive emotional dyadic motion capture (IEMOCAP) dataset, increasing the frequency resolution in preprocessing emotional speech can improve emotion recognition performance. Also, ECA after the deep convolution layer can effectively increase channel feature representation. Consequently, the best result (79.37UA 79.68WA) can be obtained, exceeding the performance of previous SER models. Furthermore, to compensate for the lack of emotional speech data, we experiment with multiple preprocessing data methods that augment trainable data preprocessed with all different settings from one sample. In the experiment, we can achieve the highest result (80.28UA 80.46WA).
G-RepsNet: A Fast and General Construction of Equivariant Networks for Arbitrary Matrix Groups
Basu, Sourya, Lohit, Suhas, Brand, Matthew
Group equivariance is a strong inductive bias useful in a wide range of deep learning tasks. However, constructing efficient equivariant networks for general groups and domains is difficult. Recent work by Finzi et al. (2021) directly solves the equivariance constraint for arbitrary matrix groups to obtain equivariant MLPs (EMLPs). But this method does not scale well and scaling is crucial in deep learning. Here, we introduce Group Representation Networks (G-RepsNets), a lightweight equivariant network for arbitrary matrix groups with features represented using tensor polynomials. The key intuition for our design is that using tensor representations in the hidden layers of a neural network along with simple inexpensive tensor operations can lead to expressive universal equivariant networks. We find G-RepsNet to be competitive to EMLP on several tasks with group symmetries such as O(5), O(1, 3), and O(3) with scalars, vectors, and second-order tensors as data types. On image classification tasks, we find that G-RepsNet using second-order representations is competitive and often even outperforms sophisticated state-of-the-art equivariant models such as GCNNs (Cohen & Welling, 2016a) and E(2)-CNNs (Weiler & Cesa, 2019). To further illustrate the generality of our approach, we show that G-RepsNet is competitive to G-FNO (Helwig et al., 2023) and EGNN (Satorras et al., 2021) on N-body predictions and solving PDEs, respectively, while being efficient.